Context¶

  • Why is this problem important to solve?

Understanding customer behavior and characteristics is usually a critical part of the marketing operations of any business or organization, with direct consequences on sales & marketing strategy. Customer segmentation is often viewed as a means to achieve a better return on investment from marketing efforts, and make organizations more efficient in terms of utilizing their money, time, and other critical resources in custom marketing strategies for different groups of customers based on their unique needs and motivations.

In the context of marketing analytics, customer segmentation has a vital role to play in optimizing ROI. It typically involves analyzing metrics around customer engagement with various marketing activities including but not limited to, ATL (above the line) marketing activities, BTL (below the line) campaigns, and targeting personalized offers. Typically, the variables of interest are customer profiles, campaign conversion rates, and information associated with various marketing channels. Based on these feature categories, the target is to create the best possible customer segments from the given data.

It has been understood from various research that customer segmentation often has a huge impact on people’s email engagement. Segmented campaigns often see over 100% more clicks than non-segmented campaigns, and email marketers who have segmented their audience before campaigning have reported a 6-7 times growth in their overall revenue. It has also been observed in various contexts that in today’s world, individual customers prefer personalized communications and offerings that cater to their particular interests.

The objectives:¶

  • What is the intended goal?

Using Unsupervised Learning ideas such as Dimensionality Reduction and Clustering, the objective is to come up with the best possible customer segments using the given customer dataset.

The key questions:¶

  • What are the key questions that need to be answered?

The key questions that need to be answered are:

  • How to create customer segments based upon the given dataset so that similar customers can be targetted together for marketing campaigns ?
  • What are the important variables to be considered for distinguishing customers based upon their behavior and characteristics in the given dataset ?
  • What variables need to be ignored, combined or cleaned without impacting analytical results in the given dataset ?

The problem formulation:¶

  • What is it that we are trying to solve using data science?

We are trying to group similar customers together or create customer segments using known unsupervised learning algorithms. We will be solving the given problem using best practices and techniques of dimensionality reduction and clustering. Using data science, we will be seeking answers to the following questions.

  • What is the corelation between different variables of the given dataset ?
  • How do we reduce the volume of the given dataset without having an impact on the analytical results of the given dataset ?
  • What methodology should we use for dimensionality reduction ?
  • What methodology should be used for clustering or creating customer segments ?

Data Dictionary¶


The dataset contains the following features:

  1. ID: Unique ID of each customer
  2. Year_Birth: Customer’s year of birth
  3. Education: Customer's level of education
  4. Marital_Status: Customer's marital status
  5. Kidhome: Number of small children in customer's household
  6. Teenhome: Number of teenagers in customer's household
  7. Income: Customer's yearly household income in USD
  8. Recency: Number of days since the last purchase
  9. Dt_Customer: Date of customer's enrollment with the company
  10. MntFishProducts: The amount spent on fish products in the last 2 years
  11. MntMeatProducts: The amount spent on meat products in the last 2 years
  12. MntFruits: The amount spent on fruits products in the last 2 years
  13. MntSweetProducts: Amount spent on sweet products in the last 2 years
  14. MntWines: The amount spent on wine products in the last 2 years
  15. MntGoldProds: The amount spent on gold products in the last 2 years
  16. NumDealsPurchases: Number of purchases made with discount
  17. NumCatalogPurchases: Number of purchases made using a catalog (buying goods to be shipped through the mail)
  18. NumStorePurchases: Number of purchases made directly in stores
  19. NumWebPurchases: Number of purchases made through the company's website
  20. NumWebVisitsMonth: Number of visits to the company's website in the last month
  21. AcceptedCmp1: 1 if customer accepted the offer in the first campaign, 0 otherwise
  22. AcceptedCmp2: 1 if customer accepted the offer in the second campaign, 0 otherwise
  23. AcceptedCmp3: 1 if customer accepted the offer in the third campaign, 0 otherwise
  24. AcceptedCmp4: 1 if customer accepted the offer in the fourth campaign, 0 otherwise
  25. AcceptedCmp5: 1 if customer accepted the offer in the fifth campaign, 0 otherwise
  26. Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
  27. Complain: 1 If the customer complained in the last 2 years, 0 otherwise

Note: You can assume that the data is collected in the year 2016.

Important Notes¶

  • This notebook can be considered a guide to refer to while solving the problem. The evaluation will be as per the Rubric shared for each Milestone. Unlike previous courses, it does not follow the pattern of the graded questions in different sections. This notebook will give you a direction on what steps need to be taken in order to get a viable solution to the problem. Please note that this is just one way of doing this. There can be other 'creative' ways to solve the problem and we urge you to feel free and explore them as an 'optional' exercise.

  • In the notebook, there are markdown cells called - Observations and Insights. It is a good practice to provide observations and extract insights from the outputs.

  • The naming convention for different variables can vary. Please consider the code provided in this notebook as a sample code.

  • All the outputs in the notebook are just for reference and can be different if you follow a different approach.

  • There are sections called Think About It in the notebook that will help you get a better understanding of the reasoning behind a particular technique/step. Interested learners can take alternative approaches if they wish to explore different techniques.

Milestone 1¶

Loading Libraries¶

In [1]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# To scale the data using z-score
from sklearn.preprocessing import StandardScaler

# To compute distances
from scipy.spatial.distance import cdist

# To perform K-means clustering and compute Silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# To visualize the elbow curve and Silhouette scores
from yellowbrick.cluster import SilhouetteVisualizer

# Importing PCA
from sklearn.decomposition import PCA

# To encode the variable
from sklearn.preprocessing import LabelEncoder

# Importing TSNE
from sklearn.manifold import TSNE

# To perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet

# To compute distances
from scipy.spatial.distance import pdist

# To import K-Medoids
from sklearn_extra.cluster import KMedoids

# To import Gaussian Mixture
from sklearn.mixture import GaussianMixture

# To supress warnings
import warnings

warnings.filterwarnings("ignore")

Let us load the data¶

In [2]:
# loading the dataset
data = pd.read_csv("marketing_campaign.csv")

Check the shape of the data¶

In [3]:
# Print the shape of the data
data.shape
Out[3]:
(2240, 27)

Observations and Insights:¶

  • There are 27 columns and 2240 observations in the dataset.

Understand the data by observing a few rows¶

In [4]:
# View first 5 rows
data.head()
Out[4]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines ... NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response
0 5524 1957 Graduation Single 58138.0 0 0 04-09-2012 58 635 ... 10 4 7 0 0 0 0 0 0 1
1 2174 1954 Graduation Single 46344.0 1 1 08-03-2014 38 11 ... 1 2 5 0 0 0 0 0 0 0
2 4141 1965 Graduation Together 71613.0 0 0 21-08-2013 26 426 ... 2 10 4 0 0 0 0 0 0 0
3 6182 1984 Graduation Together 26646.0 1 0 10-02-2014 26 11 ... 0 4 6 0 0 0 0 0 0 0
4 5324 1981 PhD Married 58293.0 1 0 19-01-2014 94 173 ... 3 6 5 0 0 0 0 0 0 0

5 rows × 27 columns

In [5]:
# View last 5 rows Hint: Use tail() method
data.tail()
Out[5]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines ... NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response
2235 10870 1967 Graduation Married 61223.0 0 1 13-06-2013 46 709 ... 3 4 5 0 0 0 0 0 0 0
2236 4001 1946 PhD Together 64014.0 2 1 10-06-2014 56 406 ... 2 5 7 0 0 0 1 0 0 0
2237 7270 1981 Graduation Divorced 56981.0 0 0 25-01-2014 91 908 ... 3 13 6 0 1 0 0 0 0 0
2238 8235 1956 Master Together 69245.0 0 1 24-01-2014 8 428 ... 5 10 3 0 0 0 0 0 0 0
2239 9405 1954 PhD Married 52869.0 1 1 15-10-2012 40 84 ... 1 4 7 0 0 0 0 0 0 1

5 rows × 27 columns

Observations and Insights:¶

  • Only 1 column - Income has null values.
  • 24 columns are numerical and 3 columns are categorical.
  • Marital_Status, Graduation have limited number of values and can be considered as categorical variable.

Let us check the data types and and missing values of each column¶

In [6]:
# Check the datatypes of each column. Hint: Use info() method
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 27 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   int64  
 16  NumWebPurchases      2240 non-null   int64  
 17  NumCatalogPurchases  2240 non-null   int64  
 18  NumStorePurchases    2240 non-null   int64  
 19  NumWebVisitsMonth    2240 non-null   int64  
 20  AcceptedCmp3         2240 non-null   int64  
 21  AcceptedCmp4         2240 non-null   int64  
 22  AcceptedCmp5         2240 non-null   int64  
 23  AcceptedCmp1         2240 non-null   int64  
 24  AcceptedCmp2         2240 non-null   int64  
 25  Complain             2240 non-null   int64  
 26  Response             2240 non-null   int64  
dtypes: float64(1), int64(23), object(3)
memory usage: 472.6+ KB
In [7]:
# Find the percentage of missing values in each column of the data
for col in data.columns:
    print(col,"-",(data.isnull().sum()/data.shape[0] * 100)[col]," %")
ID - 0.0  %
Year_Birth - 0.0  %
Education - 0.0  %
Marital_Status - 0.0  %
Income - 1.0714285714285714  %
Kidhome - 0.0  %
Teenhome - 0.0  %
Dt_Customer - 0.0  %
Recency - 0.0  %
MntWines - 0.0  %
MntFruits - 0.0  %
MntMeatProducts - 0.0  %
MntFishProducts - 0.0  %
MntSweetProducts - 0.0  %
MntGoldProds - 0.0  %
NumDealsPurchases - 0.0  %
NumWebPurchases - 0.0  %
NumCatalogPurchases - 0.0  %
NumStorePurchases - 0.0  %
NumWebVisitsMonth - 0.0  %
AcceptedCmp3 - 0.0  %
AcceptedCmp4 - 0.0  %
AcceptedCmp5 - 0.0  %
AcceptedCmp1 - 0.0  %
AcceptedCmp2 - 0.0  %
Complain - 0.0  %
Response - 0.0  %

Observations and Insights: _¶

  • Most of the columns have a numeric datatype but for "Education", "Marital_Status" and "DT_Customer".
  • There is only 1 column - "Income" that has missing values (~ 1.07 %).

We can observe that ID has no null values. Also the number of unique values are equal to the number of observations. So, ID looks like an index for the data entry and such a column would not be useful in providing any predictive power for our analysis. Hence, it can be dropped.

Dropping the ID column

In [8]:
# Remove ID column from data. Hint: Use inplace = True
data.drop(columns = ["ID"], inplace= True)

Exploratory Data Analysis¶

Let us now explore the summary statistics of numerical variables¶

In [9]:
# Explore basic summary statistics of numeric variables. Hint: Use describe() method.
data.describe().T
Out[9]:
count mean std min 25% 50% 75% max
Year_Birth 2240.0 1968.805804 11.984069 1893.0 1959.00 1970.0 1977.00 1996.0
Income 2216.0 52247.251354 25173.076661 1730.0 35303.00 51381.5 68522.00 666666.0
Kidhome 2240.0 0.444196 0.538398 0.0 0.00 0.0 1.00 2.0
Teenhome 2240.0 0.506250 0.544538 0.0 0.00 0.0 1.00 2.0
Recency 2240.0 49.109375 28.962453 0.0 24.00 49.0 74.00 99.0
MntWines 2240.0 303.935714 336.597393 0.0 23.75 173.5 504.25 1493.0
MntFruits 2240.0 26.302232 39.773434 0.0 1.00 8.0 33.00 199.0
MntMeatProducts 2240.0 166.950000 225.715373 0.0 16.00 67.0 232.00 1725.0
MntFishProducts 2240.0 37.525446 54.628979 0.0 3.00 12.0 50.00 259.0
MntSweetProducts 2240.0 27.062946 41.280498 0.0 1.00 8.0 33.00 263.0
MntGoldProds 2240.0 44.021875 52.167439 0.0 9.00 24.0 56.00 362.0
NumDealsPurchases 2240.0 2.325000 1.932238 0.0 1.00 2.0 3.00 15.0
NumWebPurchases 2240.0 4.084821 2.778714 0.0 2.00 4.0 6.00 27.0
NumCatalogPurchases 2240.0 2.662054 2.923101 0.0 0.00 2.0 4.00 28.0
NumStorePurchases 2240.0 5.790179 3.250958 0.0 3.00 5.0 8.00 13.0
NumWebVisitsMonth 2240.0 5.316518 2.426645 0.0 3.00 6.0 7.00 20.0
AcceptedCmp3 2240.0 0.072768 0.259813 0.0 0.00 0.0 0.00 1.0
AcceptedCmp4 2240.0 0.074554 0.262728 0.0 0.00 0.0 0.00 1.0
AcceptedCmp5 2240.0 0.072768 0.259813 0.0 0.00 0.0 0.00 1.0
AcceptedCmp1 2240.0 0.064286 0.245316 0.0 0.00 0.0 0.00 1.0
AcceptedCmp2 2240.0 0.012946 0.113069 0.0 0.00 0.0 0.00 1.0
Complain 2240.0 0.009375 0.096391 0.0 0.00 0.0 0.00 1.0
Response 2240.0 0.149107 0.356274 0.0 0.00 0.0 0.00 1.0

Observations and Insights: _¶

  • Maximum value of most of the variables is very high, indicating that there might be outliers in the data.
  • Standard deviation of the variables - Income, MntWines, MntMeatProducts is very high which means that the data varies a lot throughout the year.
  • All offer-acceptance variables - AcceptedCmp1, AcceptedCmp2, AcceptedCmp3, AcceptedCmp4, AcceptedCmp5 have a value of 0 which means very low acceptance of offers by customers. Similarly, - Complain, Response mostly have a value of 0.
  • Purchase amounts of all products have outliers. There might be a few customers interested in higher priced products OR there are a few high-selling products. This can be analyzed more.
  • Year_Birth - Customer's age is uniformly scattered between 1893 and 1996. SInce the data is of 2016, some customers are more than 100 years old and is very likely that they might have passed away.

Let us also explore the summary statistics of all categorical variables and the number of unique observations in each category¶

In [10]:
# List of the categorical columns in the data
cols = ["Education", "Marital_Status", "Kidhome", "Teenhome", "Complain"]

Number of unique observations in each category

In [11]:
for column in cols:
    print("Unique values in", column, "are :")
    print(data[column].value_counts(normalize=True))
    print("*" * 50)
Unique values in Education are :
Graduation    0.503125
PhD           0.216964
Master        0.165179
2n Cycle      0.090625
Basic         0.024107
Name: Education, dtype: float64
**************************************************
Unique values in Marital_Status are :
Married     0.385714
Together    0.258929
Single      0.214286
Divorced    0.103571
Widow       0.034375
Alone       0.001339
Absurd      0.000893
YOLO        0.000893
Name: Marital_Status, dtype: float64
**************************************************
Unique values in Kidhome are :
0    0.577232
1    0.401339
2    0.021429
Name: Kidhome, dtype: float64
**************************************************
Unique values in Teenhome are :
0    0.516964
1    0.459821
2    0.023214
Name: Teenhome, dtype: float64
**************************************************
Unique values in Complain are :
0    0.990625
1    0.009375
Name: Complain, dtype: float64
**************************************************

Observations and Insights: _¶

  • In education, 2n cycle and Master means the same thing. We can combine these two categories.
  • There are many categories in marital status. We can combine the category 'Alone', 'Absurd', and 'YOLO' with 'Single'.
  • The most common educational status is Graduation.
  • The most common marital status is Married.
  • There are only 0.01% customers who complained in the last two years.

Think About It:

  • We could observe from the summary statistics of categorical variables that the Education variable has 5 categories. Are all categories different from each other or can we combine some categories? Is 2n Cycle different from Master?
  • Similarly, there are 8 categories in Marital_Status with some categories having very low count of less than 5. Can we combine these categories with other categories?

Let us replace the "2n Cycle" category with "Master" in Education and "Alone", "Absurd, and "YOLO" with "Single" in Marital_Status¶

In [12]:
# Replace the category "2n Cycle" with the category "Master"

data["Education"].replace("2n Cycle", "Master", inplace = True)  # Hint: Use the replace() method and inplace=True
In [13]:
# Replace the categories "Alone", "Abusrd", "YOLO" with the category "Single"

data["Marital_Status"].replace(["Alone","Absurd", "YOLO"], "Single", inplace = True) # Hint: Use the replace() method and inplace=True

Univariate Analysis¶

Univariate analysis is used to explore each variable in a data set, separately. It looks at the range of values, as well as the central tendency of the values. It can be done for both numerical and categorical variables.

1. Univariate Analysis - Numerical Data¶

Histograms help to visualize and describe numerical data. We can also use other plots like box plot to analyze the numerical columns.

Let us plot histogram for the feature 'Income' to understand the distribution and outliers, if any.¶

In [14]:
# Create histogram for the Income feature

plt.figure(figsize=(15, 7))
sns.histplot(x="Income", data=data)
plt.show()

We could observe some extreme value on the right side of the distribution of the 'Income' feature. Let's use a box plot as it is more suitable to identify extreme values in the data.

In [15]:
# Plot the boxplot
sns.boxplot(data=data, x="Income", showmeans=True, color="violet")
Out[15]:
<AxesSubplot:xlabel='Income'>

Observations and Insights: _¶

  • As seen from the boxplot above, there are a few outliers in the Income data. We need to analyze further if the outliers need to be removed from the dataset.

Think About It

  • The histogram and the box plot are showing some extreme value on the right side of the distribution of the 'Income' feature. Can we consider them as outliers and remove or should we analyze these extreme values?
In [16]:
# Calculating the upper whisker for the Income variable

Q1 = data.quantile(q=0.25)                          # Finding the first quartile

Q3 = data.quantile(q=0.75)                          # Finding the third quartile

IQR = Q3 - Q1                                       # Finding the Inter Quartile Range

upper_whisker = (Q3 + 1.5 * IQR)["Income"]          # Calculating the Upper Whisker for the Income variable

print(upper_whisker)                                # Printing Upper Whisker
118350.5
In [17]:
# Let's check the observations with extreme value for the Income variable
data[data.Income > upper_whisker]
Out[17]:
Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits ... NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response
164 1973 PhD Married 157243.0 0 1 01-03-2014 98 20 2 ... 22 0 0 0 0 0 0 0 0 0
617 1976 PhD Together 162397.0 1 1 03-06-2013 31 85 1 ... 0 1 1 0 0 0 0 0 0 0
655 1975 Graduation Divorced 153924.0 0 0 07-02-2014 81 1 1 ... 0 0 0 0 0 0 0 0 0 0
687 1982 PhD Married 160803.0 0 0 04-08-2012 21 55 16 ... 28 1 0 0 0 0 0 0 0 0
1300 1971 Master Together 157733.0 1 0 04-06-2013 37 39 1 ... 0 1 1 0 0 0 0 0 0 0
1653 1977 Graduation Together 157146.0 0 0 29-04-2013 13 1 0 ... 28 0 1 0 0 0 0 0 0 0
2132 1949 PhD Married 156924.0 0 0 29-08-2013 85 2 1 ... 0 0 0 0 0 0 0 0 0 0
2233 1977 Graduation Together 666666.0 1 0 02-06-2013 23 9 14 ... 1 3 6 0 0 0 0 0 0 0

8 rows × 26 columns

In [18]:
data[data.Income > upper_whisker].describe().T
Out[18]:
count mean std min 25% 50% 75% max
Year_Birth 8.0 1972.500 10.028531 1949.0 1972.50 1975.5 1977.00 1982.0
Income 8.0 221604.500 179850.404431 153924.0 157090.50 157488.0 161201.50 666666.0
Kidhome 8.0 0.375 0.517549 0.0 0.00 0.0 1.00 1.0
Teenhome 8.0 0.250 0.462910 0.0 0.00 0.0 0.25 1.0
Recency 8.0 48.625 33.687376 13.0 22.50 34.0 82.00 98.0
MntWines 8.0 26.500 30.798887 1.0 1.75 14.5 43.00 85.0
MntFruits 8.0 4.500 6.524678 0.0 1.00 1.0 5.00 16.0
MntMeatProducts 8.0 621.875 846.511402 1.0 7.25 17.0 1592.00 1725.0
MntFishProducts 8.0 4.250 5.650537 1.0 1.00 2.0 3.50 17.0
MntSweetProducts 8.0 1.250 0.886405 0.0 1.00 1.0 1.25 3.0
MntGoldProds 8.0 3.750 4.131759 1.0 1.00 1.5 5.00 12.0
NumDealsPurchases 8.0 4.250 6.777062 0.0 0.00 0.0 6.75 15.0
NumWebPurchases 8.0 0.500 1.069045 0.0 0.00 0.0 0.25 3.0
NumCatalogPurchases 8.0 9.875 13.484780 0.0 0.00 0.5 23.50 28.0
NumStorePurchases 8.0 0.750 1.035098 0.0 0.00 0.5 1.00 3.0
NumWebVisitsMonth 8.0 1.125 2.031010 0.0 0.00 0.5 1.00 6.0
AcceptedCmp3 8.0 0.000 0.000000 0.0 0.00 0.0 0.00 0.0
AcceptedCmp4 8.0 0.000 0.000000 0.0 0.00 0.0 0.00 0.0
AcceptedCmp5 8.0 0.000 0.000000 0.0 0.00 0.0 0.00 0.0
AcceptedCmp1 8.0 0.000 0.000000 0.0 0.00 0.0 0.00 0.0
AcceptedCmp2 8.0 0.000 0.000000 0.0 0.00 0.0 0.00 0.0
Complain 8.0 0.000 0.000000 0.0 0.00 0.0 0.00 0.0
Response 8.0 0.000 0.000000 0.0 0.00 0.0 0.00 0.0
In [19]:
data[data.Income > upper_whisker][["NumDealsPurchases", "NumWebPurchases", "NumCatalogPurchases", "NumStorePurchases"]]
Out[19]:
NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases
164 15 0 22 0
617 0 0 0 1
655 0 0 0 0
687 15 0 28 1
1300 0 1 0 1
1653 0 0 28 0
2132 0 0 0 0
2233 4 3 1 3

Think About It:

  • We observed that there are only a few rows with extreme values for the Income variable. Is that enough information to treat (or not to treat) them? Do we know at what percentile the upper whisker lies?
In [20]:
# Check the 99.5% percentile value for the Income variable
print(data.quantile(q=0.995)["Income"]) 
102145.75000000003

Observations and Insights:¶

  • The upper whisker lies beyond 99.5% percentile value of the Income variable which means that all values greater than the value of upper whisker can be considered for dropping.
  • There are 8 customer records which have Income greater than the upper whisker.
  • We need to look at other aspects of data, especially, Offer acceptance3s and Number of purchases.
  • As seen from the data description above, none of the 8 customers have ever accepted offers but have made purchases across different channels.
  • From a marketing campaigning standpoint, we will keep 3 customers who have made purchases because these customers can provide meaningful insight from their purchase standpoint. All other 5 customers can be removed from the dataset (Index entries - 617,655,1300,2132,2233)
In [21]:
# Dropping observations identified as outliers 
data.drop(index=[617,655,1300,2132,2233], inplace=True) # Pass the indices of the observations (separated by a comma) to drop them

Now, let's check the distribution of the Income variable after dropping outliers.

In [22]:
# Plot histogram and 'Income'
def draw_histogram(col):
    plt.figure(figsize=(15, 7))
    sns.histplot(x=col, data=data)
    plt.show()

draw_histogram("Income")
In [23]:
# Plot the histogram for 'MntWines'
draw_histogram("MntWines")
In [24]:
# Plot the histogram for 'MntFruits'
draw_histogram("MntFruits")
In [25]:
# Plot the histogram for 'MntMeatProducts' 
draw_histogram("MntMeatProducts")
In [26]:
# Plot the histogram for 'MntFishProduct'
draw_histogram("MntFishProducts")
In [27]:
# Plot the histogram for 'MntSweetProducts'
draw_histogram("MntSweetProducts")
In [28]:
# Plot the histogram for 'MntGoldProducts'
draw_histogram("MntGoldProds")

Note: Try plotting histogram for different numerical features and understand how the data looks like.¶

In [113]:
numeric_cols = ["NumDealsPurchases", "NumWebPurchases", "NumCatalogPurchases","NumStorePurchases","NumWebVisitsMonth"
               ,"Year_Birth","Kidhome","Teenhome","Recency","AcceptedCmp1","AcceptedCmp2","AcceptedCmp3","AcceptedCmp4"
               ,"AcceptedCmp5","Complain","Response"]

for col in numeric_cols:
    draw_histogram(col)

Observations and Insights for all the plots:¶

  • Columns - "NumDealsPurchases", "NumWebPurchases", "NumCatalogPurchases","NumStorePurchases","NumWebVisitsMonth" have scatterred values which means that the dat is well spread out.
  • Column - "Year_Birth" mostly has values between 1940 and 1996. There are outliers on the left representing customers who might have passed away by 2016, so they can be ignored/removed from the analysis.
  • Columns - "Kidhome","Teenhome" have 3 values of 0, 1 and 2. These variables can be marked as categorical variables. Both these variables can be combined together to represent "Number of Children".
  • Column - "Recency" data seems to have a high variablility.
  • Columns - "AcceptedCmp1","AcceptedCmp2","AcceptedCmp3","AcceptedCmp4","AcceptedCmp5","Complain","Response" have a value of either 0 or 1. They can be treated as categorical data also.

2. Univariate analysis - Categorical Data¶

Let us write a function that will help us create bar plots that indicate the percentage for each category. This function takes the categorical column as the input and returns the bar plot for the variable.

In [29]:
def perc_on_bar(z):
    '''
    plot
    feature: categorical feature
    the function won't work if a column is passed in hue parameter
    '''

    total = len(data[z])                                          # Length of the column
    plt.figure(figsize=(15,5))
    ax = sns.countplot(data[z],palette='Paired',order = data[z].value_counts().index)
    for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total) # Percentage of each class of the category
        x = p.get_x() + p.get_width() / 2 - 0.05                  # Width of the plot
        y = p.get_y() + p.get_height()                            # Height of the plot
        
        ax.annotate(percentage, (x, y), size = 12)                # Annotate the percentage 
    
    plt.show()                                                    # Show the plot

Observations and Insights from all plots:¶

  • "Marital_Status" - Majority of the customers (38.6%) in the dataset are married.
  • "Education" -- Majority of the customers (50.3%) have done their graduation. 2.4% customers only have Basic education.
  • "KidHome", "TeenHome" - Majority of the customers do not have children (Kids or Teens) at home. The data across different values of 0, 1, 2 are almost equally spread out in both the variables. These 2 variables can be combined together.
  • "Complain" - Majority of cusomers (99.1%) have never complained.

Bivariate Analysis¶

We have analyzed different categorical and numerical variables. Now, let's check how different variables are related to each other.

Correlation Heat map¶

Heat map can show a 2D correlation matrix between numerical features.

In [32]:
numeric_cols = ['Year_Birth', 'Income', 'Kidhome', 'Teenhome', 'Recency', 'MntWines', 'MntFruits',
       'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
       'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
       'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
       'AcceptedCmp2', 'Response']
plt.figure(figsize=(15, 7))                                                        # Setting the plot size
sns.heatmap(data[numeric_cols].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")  # Plotting the correlation plot
plt.show()

Observations and Insights: _¶

  • "Income" - variable is highly corelated with mostly all variables.
  • "AcceptedCmp1,AcceptedCmp2,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,Response" - variables are not corelated with any other variables.
  • There is a string corelation observed between Amount Spent variables, namely, "NumWebPurchases", "NumCatalogPurchases","NumStorePurchases","NumWebVisitsMonth" and Purchases variables, namely, NumWebPurchases, NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth.
  • Combining all amounts spent into 1 variable can be helpful in providing more meaningful insights.

The above correlation heatmap only shows the relationship between numerical variables. Let's check the relationship of numerical variables with categorical variables.

Education Vs Income¶

In [41]:
print(sns.barplot(x="Education", y="Income", data=data))
AxesSubplot(0.125,0.11;0.775x0.77)

Observations and Insights: _¶

  • Customers with "Basic" education have the least income.
  • Customers with "PhD" have the maximum income.
  • Customers with "Graduation" and "Master" education earn equally.

Marital Status Vs Income¶

In [42]:
# Plot the bar plot for Marital_Status and Income
print(sns.barplot(x="Marital_Status", y="Income", data=data))
AxesSubplot(0.125,0.11;0.775x0.77)

Observations and Insights: _¶

  • All customers have similar income irrespective of their marital status. However, customers with "Widow" status have a slightly higher income.

Kidhome Vs Income¶

In [43]:
# Plot the bar plot for Kidhome and Income
print(sns.barplot(x="Kidhome", y="Income", data=data))
AxesSubplot(0.125,0.11;0.775x0.77)

Observations and Insights:¶

  • Customers with 0 kids have the maximum income.
  • Customer with kids have almost similar income.

We can also visualize the relationship between two categorical variables.

Marital_Status Vs Kidhome¶

In [44]:
# Plot the bar plot for Marital_Status and Kidhome
pd.crosstab(data["Marital_Status"],data["Kidhome"]).plot(kind='bar',stacked=False)
Out[44]:
<AxesSubplot:xlabel='Marital_Status'>

Observations and Insights: _¶

  • Majority of the customers dont have any kids irrespective of their marital status.
  • A high number of customers have 1 kid irrespective of their marital status.
  • Very few customers have 2 kids. However, customers with widow status do not have 2 kids.

Feature Engineering and Data Processing¶

In this section, we will first prepare our dataset for analysis.

  • Creating new columns
  • Imputing missing values

Think About It:

  • The Year_Birth column in the current format might not be very useful in our analysis. The Year_Birth column contains the information about Day, Month, and year. Can we extract the age of each customer?
  • Are there other columns which can be used to create new features?

Age¶

In [45]:
# Extract only the year from the Year_Birth variable and subtracting it from 2016 will give us the age of the customer at the time of data collection in 2016

data["Age"] = 2016 - pd.to_datetime(data["Year_Birth"], format="%Y").apply(lambda x: x.year) 

# Sorting the values in ascending order 
data["Age"].sort_values()                                    
Out[45]:
1170    20
46      20
1850    21
696     21
2213    21
        ..
39      73
358     73
894     73
424     75
1950    76
Name: Age, Length: 2232, dtype: int64

Observations and Insights: _¶

  • Customers have a range of age between 10 and 123.
  • Customers with ages greater then 116 can be removed as they might have passed away.

Think About It:

  • We could observe from the above output that there are customers with an age greater than 115. Can this be true or a data anomaly? Can we drop these observations?

Now, let's check the distribution of age in the data.

In [51]:
# Plot histogram to check the distribution of age
draw_histogram("Age")

Observations and Insights: _¶

  • Data is spread across all ages between 20 and 76.

Kids¶

  • Let's create feature "Kids" indicating the total kids and teens in the home.
In [52]:
# Add Kidhome and Teenhome variables to create the new feature called "Kids"
data["Kids"] = data["Kidhome"] + data["Teenhome"]

Family Size¶

  • Let's create a new variable called 'Family Size' to find out how many members each family has.
  • For this, we need to have a look at the Marital_Status variable, and see what are the categories.
In [53]:
# Check the unique categories in Marial_Status
data["Marital_Status"].unique()
Out[53]:
array(['Single', 'Together', 'Married', 'Divorced', 'Widow'], dtype=object)
  • We can combine the sub-categories Single, Divorced, Widow as "Single" and we can combine the sub-categories Married and Together as "Relationship"
  • Then we can create a new variable called "Status" and assign values 1 and 2 to categories Single and Relationship, respectively.
  • Then, we can use the Kids (calculated above) and the Status column to find the family size.
In [54]:
# Replace "Married" and "Together" with "Relationship"
data["Marital_Status"].replace(["Married","Together"], "Relationship", inplace = True)
In [55]:
# Replace "Divorced" and "Widow" with "Single"
data["Marital_Status"].replace(["Single","Divorced", "Widow"], "Single", inplace = True)
In [56]:
# Create a new feature called "Status" by replacing "Single" with 1 and "Relationship" with 2 in Marital_Status
data["Status"] = data["Marital_Status"].replace({"Single": 1, "Relationship": 2}) 
In [57]:
# Add two variables Status and Kids to get the total number of persons in each family
data["Family_Size"] = data["Kids"] + data["Status"]

Expenses¶

  • Let's create a new feature called "Expenses", indicating the total amount spent by the customers in various products over the span of two years.
In [58]:
# Create a new feature
# Add the amount spent on each of product 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds'
expense_cols = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']
data["Expenses"] = data[expense_cols].sum(axis=1)

Total Purchases¶

  • Let's create a new feature called "NumTotalPurchases", indicating the total number of products purchased by the customers.
In [59]:
# Create a new feature
# Add the number of purchases from each channel 'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases'
purchase_cols = ['NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases']
data["NumTotalPurchases"] = data[purchase_cols].sum(axis=1)

Engaged in Days¶

  • Let's create a new feature called "Engaged in days", indicating how long the customer has been with the company.
In [60]:
# Converting Dt_customer variable to Python date time object
data["Dt_Customer"] = pd.to_datetime(data["Dt_Customer"]) 

Let's check the max and min of the date.

In [61]:
# Check the minimum of the date
# Hint: Use the min() method
data["Dt_Customer"].min()
Out[61]:
Timestamp('2012-01-08 00:00:00')
In [62]:
# Check the maximum of the date
# Hint: Use the max() method
data["Dt_Customer"].max()
Out[62]:
Timestamp('2014-12-06 00:00:00')

Think About It:

  • From the above output from the max function, we observed that the last customer enrollment date is December 6th, 2014. Can we extract the number of days a customer has been with the company using some date as the threshold? Can January 1st, 2015 be that threshold?
In [63]:
 # Assigning date to the day variable
data["day"] = "01-01-2015"                         

# Converting the variable day to Python datetime object
data["day"] = pd.to_datetime(data.day)              
In [64]:
data["Engaged_in_days"] = (data["day"] - data["Dt_Customer"]).dt.days     

TotalAcceptedCmp¶

  • Let's create a new feature called "TotalAcceptedCmp" that shows how many offers customers have accepted.
In [65]:
# Add all the campaign related variables to get the total number of accepted campaigns by a customer
# "AcceptedCmp1", "AcceptedCmp2", "AcceptedCmp3", "AcceptedCmp4", "AcceptedCmp5", "Response"
data["TotalAcceptedCmp"] = data["AcceptedCmp1"] + data["AcceptedCmp2"] + data["AcceptedCmp3"] + data["AcceptedCmp4"] + data["AcceptedCmp5"] + data["Response"]

AmountPerPurchase¶

  • Let's create a new feature called "AmountPerPurchase" indicating the amount spent per purchase.
In [66]:
# Divide the "Expenses" by "NumTotalPurchases" to create the new feature AmountPerPurchase 
data['AmountPerPurchase'] = data["Expenses"] / data["NumTotalPurchases"]

Now, let's check the maximum value of the AmountPerPurchase.

In [67]:
# Check the max value
# Hint: Use max() function
data['AmountPerPurchase'].max()
Out[67]:
inf

Think About It:

  • Is the maximum value in the above output valid? What could be the potential reason for such output?
  • How many such values are there? Can we drop such observations?
In [68]:
# Find how many observations have NumTotalPurchases equal to 0
data[data['NumTotalPurchases'] == 0]
Out[68]:
Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits ... Age Kids Status Family_Size Expenses NumTotalPurchases day Engaged_in_days TotalAcceptedCmp AmountPerPurchase
981 1965 Graduation Single 4861.0 0 0 2014-06-22 20 2 1 ... 51 0 1 1 6 0 2015-01-01 193 0 inf
1524 1973 Graduation Single 3502.0 1 0 2013-04-13 56 2 1 ... 43 1 1 2 5 0 2015-01-01 628 0 inf

2 rows × 36 columns

In [69]:
# Drop the observations with NumTotalPurchases equal to 0, using their indices
data.drop(index=[981,1524], inplace=True)

Now, let's check the distribution of values in AmountPerPurchase column.

In [70]:
# Check the summary statistics of the AmountPerPurchase variable 
data['AmountPerPurchase'].describe()
Out[70]:
count    2230.000000
mean       33.294200
std        45.015305
min         0.533333
25%         9.714286
50%        23.379808
75%        45.313963
max      1679.000000
Name: AmountPerPurchase, dtype: float64
In [71]:
# Plot the histogram for the AmountPerPurchas variable
draw_histogram("AmountPerPurchase")

Observations and Insights: _¶

  • Most of the customers have small range of Amount per purchase values. However, the max value of Amount per purchase is 1639 which is causing the histogram above to show the maximum spread of data towards the left and an almost empty area towards the right.

Imputing Missing Values¶

In [72]:
# Impute the missing values for the Income variable with the median
data['Income'].fillna(data.Income.median(), inplace = True)

Now that we are done with data preprocessing, let's visualize new features against the new income variable we have after imputing missing values.

Income Vs Expenses¶

In [73]:
# Plot the scatter plot with Expenses on Y-axis and Income on X-axis  

plt.figure(figsize=(20, 10))                                    # Setting the plot size

sns.scatterplot(x="Income", y="Expenses", data=data)                                        # Hint: Use sns.scatterplot()  

plt.xticks(fontsize=16)                                         # Font size of X-label

plt.yticks(fontsize=16)                                         # Font size of Y-label

plt.xlabel("Income", fontsize=20, labelpad=20)                  # Title of X-axis

plt.ylabel("Expenses", fontsize=20, labelpad=20)                # Title of Y-axis
Out[73]:
Text(0, 0.5, 'Expenses')

Observations and Insights: _¶

  • From the scatterplot above, we can observe a strong positive corelation between Income and Expenses.
  • We also see outliers on the far right side of the plot which need to be further analyzed.

Family Size Vs Income¶

In [74]:
# Plot the bar plot for Family Size on X-axis and Income on Y-axis
print(sns.barplot(x="Family_Size", y="Income", data=data))
AxesSubplot(0.125,0.11;0.775x0.77)

Observations and Insights: ¶

  • Customers with lower family size have slightly higher income.
  • As seen from the error bars (or whiskers) on each family size bar, families with 5 children have the most variability and all other family sizes have comparatively less variability.

Proposed approach¶

  • Potential techniques - What different techniques should be explored?

We will explore the following unsupervised learning techniques to identify the best fit for the given dataset:

A. Dimensionality Reduction

We will first try to reduce the dimensions of the given dataset using the following techniques:

  1. PCA (Principal Component Analysis)
  2. t-SNE (t-distribution Stochastic Neighbor Embedding)

B. Clustering

After reducing the dimensionality of the dataset, we will use the following different methods to create clusters.

  1. K-means clustering
  2. K-Medoids Clustering
  3. Gaussian mixture models
  4. DBSCAN
  • Overall solution design - What is the potential solution design?

Dimensionality Reduction We will use unsupervised learning techniques, to reduce the dimensionality and bring it to a lower dimension so that it becomes understandable and easy to interpret the insights found after exploration and analysis. We will use two major techniques that we will focus on to reduce the dimensions of the data:

  1. PCA (Principal Component Analysis)
  2. t-SNE (t-distribution Stochastic Neighbor Embedding)

PCA It is a technique in unsupervised learning that is used to reduce the dimensionality of the given data. It is a linear dimensionality reduction technique. PCA confirms that most of the variance is preserved while reducing the dimensions so that the data with reduced dimensions will still give equally trustworthy inferences and insights during any analytical process.

Correlation is generally used when there are features in the data that have different units of measurements. To nullify the effect of scale, normalization needs to be done. We will scale the data so that different units of measurement do not impact the variance of data.

t-SNE SNE technique - It is a probabilistic approach to transform objects from high dimension to low dimension. In the given dataset, it takes the individual data points and fits a Gaussian distribution on it. In terms of approach, this method is totally different from the principal component analysis. In the lower dimension, it creates another Gaussian in such a way that the two distributions look similar to each other. Two distributions can be considered to be similar if their corresponding dimensions are comparable to each other. In terms of approach this method.

The t-SNE method differs from normal SNE in terms of the distribution used to replicate the data from higher to lower dimensions. Instead of using Gaussian distribution here, t-distribution is deployed on top of every data point. It is a non-linear dimensionality reduction technique.

Clustering After reducing the dimensionality of the dataset, we will use the following different methods to create clusters.

  1. K-means clustering
  2. K-Medoids Clustering
  3. Gaussian mixture models
  4. DBSCAN

K-means clustering In K-means Clustering, we create K clusters from the given data. We start by deciding on a fixed number of clusters we expect to find in the data, denoted by K. Data visualization techniques like PCA and t-SNE, and other methods like elbow plots can help give us an idea about a suitable value for K.

K-Medoids Clustering K-means clustering uses means to find centroids in each iteration, outliers impact the final clusters. K-medoids clutering uses medians instead of means in each iteration and hence, reduces the impact of outliers.

Gaussian Mixture Models GMM is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. Hence, the distribution of samples within each cluster is modeled by a Gaussian. Parameter estimates for Gaussian distributions,

DBSCAN One of the major benefits of DBSCAN is that it handles the outliers existing in the data all by itself.

  • Measures of success - What are the key measures of success?

Quality of Clustering - Silhouette Plot We will use Silhoutte plot to measure the quality of clusters. Silhoutte plot displays Silhoutte score for each cluster which ranges from -1 to 1. A value of 1 means clusters are well apart from each other and clearly distinguished. A value of 0 means clusters are indifferent, or we can say that the distance between clusters is not significant. A value of -1 means clusters are assigned in the wrong way. Generally, a value of 0.5 is used as a cut-off value to say that the data points are well clustered.

In [ ]:
 

Customer Segmentation¶

Milestone 2¶

Note: This is in continuation to the data preprocessing we did in Milestone 1. Results might differ if you have followed different steps in data preprocessing.

-------------------------------- Begin - Getting Milestone 1 Dataset --------------------------------------¶

In [2]:
!pip install yellowbrick --user
Requirement already satisfied: yellowbrick in c:\users\ashis\appdata\roaming\python\python39\site-packages (1.4)
Requirement already satisfied: scikit-learn>=1.0.0 in c:\programdata\anaconda3\lib\site-packages (from yellowbrick) (1.0.2)
Requirement already satisfied: numpy>=1.16.0 in c:\programdata\anaconda3\lib\site-packages (from yellowbrick) (1.21.5)
Requirement already satisfied: cycler>=0.10.0 in c:\programdata\anaconda3\lib\site-packages (from yellowbrick) (0.11.0)
Requirement already satisfied: scipy>=1.0.0 in c:\programdata\anaconda3\lib\site-packages (from yellowbrick) (1.7.3)
Requirement already satisfied: matplotlib!=3.0.0,>=2.0.2 in c:\programdata\anaconda3\lib\site-packages (from yellowbrick) (3.5.1)
Requirement already satisfied: pillow>=6.2.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (9.0.1)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.3.2)
Requirement already satisfied: fonttools>=4.22.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (4.25.0)
Requirement already satisfied: packaging>=20.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (21.3)
Requirement already satisfied: python-dateutil>=2.7 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (2.8.2)
Requirement already satisfied: pyparsing>=2.2.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (3.0.4)
Requirement already satisfied: six>=1.5 in c:\programdata\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.16.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\programdata\anaconda3\lib\site-packages (from scikit-learn>=1.0.0->yellowbrick) (2.2.0)
Requirement already satisfied: joblib>=0.11 in c:\programdata\anaconda3\lib\site-packages (from scikit-learn>=1.0.0->yellowbrick) (1.1.0)
Note: you may need to restart the kernel to use updated packages.
In [75]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# To scale the data using z-score
from sklearn.preprocessing import StandardScaler

# To compute distances
from scipy.spatial.distance import cdist

# To perform K-means clustering and compute Silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# To visualize the elbow curve and Silhouette scores
from yellowbrick.cluster import SilhouetteVisualizer

# Importing PCA
from sklearn.decomposition import PCA

# To encode the variable
from sklearn.preprocessing import LabelEncoder

# Importing TSNE
from sklearn.manifold import TSNE

# To perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet

# To compute distances
from scipy.spatial.distance import pdist

# To import K-Medoids
from sklearn_extra.cluster import KMedoids

# To import Gaussian Mixture
from sklearn.mixture import GaussianMixture

# To supress warnings
import warnings

warnings.filterwarnings("ignore")
In [76]:
data = pd.read_csv("marketing_campaign.csv")
In [77]:
data.drop(columns = ["ID"], inplace= True)
In [78]:
data["Education"].replace("2n Cycle", "Master", inplace = True)  # Hint: Use the replace() method and inplace=True
In [79]:
data["Marital_Status"].replace(["Alone","Absurd", "YOLO"], "Single", inplace = True) # Hint: Use the replace() method and inplace=True
In [80]:
Q1 = data.quantile(q=0.25)                          # Finding the first quartile
Q3 = data.quantile(q=0.75)                          # Finding the third quartile
IQR = Q3 - Q1                                       # Finding the Inter Quartile Range
upper_whisker = (Q3 + 1.5 * IQR)["Income"]          # Calculating the Upper Whisker for the Income variable
data[data.Income > upper_whisker][["NumDealsPurchases", "NumWebPurchases", "NumCatalogPurchases", "NumStorePurchases"]]
data.drop(index=[617,655,1300,2132,2233], inplace=True) # Pass the indices of the observations (separated by a comma) to drop them
In [81]:
def draw_histogram(col):
    plt.figure(figsize=(15, 7))
    sns.histplot(x=col, data=data)
    plt.show()

draw_histogram("Income")
In [82]:
data["Age"] = 2016 - pd.to_datetime(data["Year_Birth"], format="%Y").apply(lambda x: x.year) 

# Sorting the values in ascending order 
data["Age"].sort_values()  
Out[82]:
1170     20
46       20
747      21
2213     21
995      21
       ... 
424      75
1950     76
192     116
339     117
239     123
Name: Age, Length: 2235, dtype: int64
In [83]:
data.drop(index = [192,339,239], inplace=True)
In [84]:
data["Kids"] = data["Kidhome"] + data["Teenhome"]
In [85]:
data["Marital_Status"].replace(["Married","Together"], "Relationship", inplace = True)
data["Marital_Status"].replace(["Single","Divorced", "Widow"], "Single", inplace = True)
data["Status"] = data["Marital_Status"].replace({"Single": 1, "Relationship": 2}) 
data["Family_Size"] = data["Kids"] + data["Status"]

expense_cols = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']
data["Expenses"] = data[expense_cols].sum(axis=1)

purchase_cols = ['NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases']
data["NumTotalPurchases"] = data[purchase_cols].sum(axis=1)

data["Dt_Customer"] = pd.to_datetime(data["Dt_Customer"]) 

 # Assigning date to the day variable
data["day"] = "01-01-2015"                         

# Converting the variable day to Python datetime object
data["day"] = pd.to_datetime(data.day)    

data["Engaged_in_days"] = (data["day"] - data["Dt_Customer"]).dt.days    

data["TotalAcceptedCmp"] = data["AcceptedCmp1"] + data["AcceptedCmp2"] + data["AcceptedCmp3"] + data["AcceptedCmp4"] + data["AcceptedCmp5"] + data["Response"]

data['AmountPerPurchase'] = data["Expenses"] / data["NumTotalPurchases"]
In [86]:
data.drop(index=[981,1524], inplace=True)
In [87]:
data['Income'].fillna(data.Income.median(), inplace = True)
In [88]:
data.shape
Out[88]:
(2230, 36)

-------------------------------- End - Getting Milestone 1 Dataset --------------------------------------¶

Preparing Data for Segmentation¶

Dropping columns that we will not use for segmentation¶

The decision about which variables to use for clustering is a critically important decision that will have a big impact on the clustering solution. So we need to think carefully about the variables we will choose for clustering. Clearly, this is a step where a lot of contextual knowledge, creativity, and experimentation/iterations are needed.

Moreover, we often use only a few of the data attributes for segmentation (the segmentation attributes) and use some of the remaining ones (the profiling attributes) only to profile the clusters. For example, in market research and market segmentation, we can use behavioral data for segmentation (to segment the customers based on their behavior like amount spent, units bought, etc.), and then use both demographic as well as behavioral data for profiling the segments found.

Here, we will use the behavioral attributes for segmentation and drop the demographic attributes like Income, Age, and Family_Size. In addition to this, we need to drop some other columns which are mentioned below.

  • Dt_Customer: We have created the Engaged_in_days variable using the Dt_Customer variable. Hence, we can drop this variable as it will not help with segmentation.
  • Complain: About 95% of the customers didn't complain and have the same value for this column. This variable will not have a major impact on segmentation. Hence, we can drop this variable.
  • day: We have created the Engaged_in_days variable using the 'day' variable. Hence, we can drop this variable as it will not help with segmentation.
  • Status: This column was created just to get the Family_Size variable that contains the information about the Status. Hence, we can drop this variable.
  • We also need to drop categorical variables like Education and Marital_Status, Kids, Kidhome, and Teenhome as distance-based algorithms cannot use the default distance like Euclidean to find the distance between categorical and numerical variables.
  • We can also drop categorical variables like AcceptedCmp1, AcceptedCmp2, AcceptedCmp3, AcceptedCmp4, AcceptedCmp5, and Response for which we have create the variable TotalAcceptedCmp which is the aggregate of all these variables.
In [89]:
# Dropping all the irrelevant columns and storing in data_model
data_model = data.drop(
    columns=[
        "Year_Birth",
        "Dt_Customer",
        "day",
        "Complain",
        "Response",
        "AcceptedCmp1",
        "AcceptedCmp2",
        "AcceptedCmp3",
        "AcceptedCmp4",
        "AcceptedCmp5",
        "Marital_Status",
        "Status",
        "Kids",
        'Education',
        'Kidhome',
        'Teenhome', 'Income','Age', 'Family_Size'
    ],
    axis=1,
)
In [90]:
# Check the shape of new data 
data_model.shape
Out[90]:
(2230, 17)
In [91]:
# Check first five rows of new data
data_model.head()
Out[91]:
Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth Expenses NumTotalPurchases Engaged_in_days TotalAcceptedCmp AmountPerPurchase
0 58 635 88 546 172 88 88 3 8 10 4 7 1617 25 997 1 64.680000
1 38 11 1 6 2 1 6 2 1 1 2 5 27 6 151 0 4.500000
2 26 426 49 127 111 21 42 1 8 2 10 4 776 21 498 0 36.952381
3 26 11 4 20 10 3 5 2 2 0 4 6 53 8 91 0 6.625000
4 94 173 43 118 46 27 15 5 5 3 6 5 422 19 347 0 22.210526

Let's plot the correlation plot after we've removed the irrelevant variables.

In [92]:
# Plot the correlation plot for new data
plt.figure(figsize=(15, 7))                                                        # Setting the plot size
sns.heatmap(data_model.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")  # Plotting the correlation plot
plt.show()

Observations and Insights:

  • Recency and Engaged_In_Days have a low correlation with all variables.
  • Expenses has a low corelation with NumDealsPurchases and Engaged_in_days. Expenses has a positive correlation with almost all other variables. Expenses has a strong correlation with MntWines, MntMeatProducts, NumCataloguePurchases, NumStorePurchases and NumTotalPurchases.
  • NumWebVisitsMonth has a negative corelation with most of the variables.
  • NumDealsPurchases does not seem to have a strong corelation with most of the variables. However, there is a positive corelation observed with NumWebVisitsMonth and NumWebPurchases which might mean that most of the deals might be released on the web channel.
  • MntMeatProducts has a high corelation with NumCatalogPurchases.

Scaling the Data¶

What is feature scaling?

Feature scaling is a class of statistical techniques that, as the name implies, scales the features of our data so that they all have a similar range. You'll understand better if we look at an example:

If you have multiple independent variables like Age, Income, and Amount related variables, with their range as (18–100 Years), (25K–75K), and (100–200), respectively, feature scaling would help them all to be in the same range.

Why feature scaling is important in Unsupervised Learning?

Feature scaling is especially relevant in machine learning models that compute some sort of distance metric as we do in most clustering algorithms, for example, K-Means.

So, scaling should be done to avoid the problem of one feature dominating over others because the unsupervised learning algorithm uses distance to find the similarity between data points.

Let's scale the data

Standard Scaler: StandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance. Unit variance means dividing all the values by the standard deviation.

SC.png

  1. Data standardization is the process of rescaling the attributes so that they have a mean of 0 and a variance of 1.
  2. The ultimate goal to perform standardization is to bring down all the features to a common scale without distorting the differences in the range of the values.
  3. In sklearn.preprocessing.StandardScaler(), centering and scaling happen independently on each feature.
In [93]:
# Applying standard scaler on new data
scaler = StandardScaler()                                                   # Initialize the Standard Scaler

df_scaled = scaler.fit_transform(data_model)                                        # fit_transform the scaler function on new data

df_scaled = pd.DataFrame(df_scaled, columns=data_model.columns)      # Converting the embeddings to a dataframe

df_scaled.head()
Out[93]:
Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth Expenses NumTotalPurchases Engaged_in_days TotalAcceptedCmp AmountPerPurchase
0 0.306982 0.980594 1.551219 1.676726 2.457442 1.472507 0.843113 0.345710 1.405761 2.506784 -0.558344 0.698098 1.677422 1.318790 1.974283 0.620406 0.697381
1 -0.383595 -0.872986 -0.637634 -0.714923 -0.651718 -0.632873 -0.731545 -0.172043 -1.116240 -0.571292 -1.175192 -0.132838 -0.964662 -1.164859 -1.667562 -0.503481 -0.639797
2 -0.797941 0.359764 0.570009 -0.179016 1.341802 -0.148877 -0.040232 -0.689796 1.405761 -0.229284 1.292199 -0.548306 0.279942 0.795916 -0.173803 -0.503481 0.081283
3 -0.797941 -0.872986 -0.562157 -0.652917 -0.505404 -0.584473 -0.750748 -0.172043 -0.755954 -0.913301 -0.558344 0.282630 -0.921458 -0.903422 -1.925849 -0.503481 -0.592580
4 1.550020 -0.391768 0.419054 -0.218877 0.153006 -0.003679 -0.558717 1.381217 0.324904 0.112725 0.058504 -0.132838 -0.308295 0.534480 -0.823825 -0.503481 -0.246275

Applying T-SNE and PCA to the data to visualize the data distributed in 2 dimensions¶

Applying T-SNE¶

In [94]:
# Fitting T-SNE with number of components equal to 2 to visualize how data is distributed

tsne = TSNE(n_components = 2, random_state = 1,perplexity = 35)        # Initializing T-SNE with number of component equal to 2, random_state=1, and perplexity=35

data_air_pol_tsne = tsne.fit_transform(df_scaled)                            # fit_transform T-SNE on new data

data_air_pol_tsne = pd.DataFrame(data_air_pol_tsne, columns=[0, 1])           # Converting the embeddings to a dataframe

plt.figure(figsize=(7, 7))                                                    # Scatter plot for two components

sns.scatterplot(x=0, y=1, data=data_air_pol_tsne)                             # Plotting T-SNE
Out[94]:
<AxesSubplot:xlabel='0', ylabel='1'>

Observation and Insights:

  • The data as shown above does not seem to show clear groups separated from each other.
  • We need to analyze this data again through PCA and further analyze.

Applying PCA¶

Think about it:

  • Should we apply clustering algorithms on the current data or should we apply PCA on the data before applying clustering algorithms? How would this help?

When the variables used in clustering are highly correlated, it causes multicollinearity, which affects the clustering method and results in poor cluster profiling (or biased toward a few variables). PCA can be used to reduce the multicollinearity between the variables.

In [95]:
# Defining the number of principal components to generate
n = data_model.shape[1]                                        # Storing the number of variables in the data

pca = PCA(n_components = n, random_state = 1)                                        # Initialize PCA with n_components = n and random_state=1

data_pca = pd.DataFrame(pca.fit_transform(df_scaled))                      # fit_transform PCA on the scaled data

# The percentage of variance explained by each principal component is stored
exp_var = pca.explained_variance_ratio_   

Let's plot the first two components and see how the data points are distributed.

In [96]:
# Scatter plot for two components using the dataframe data_pca
plt.figure(figsize=(7, 7))  

sns.scatterplot(x=0, y=1, data=data_pca)
Out[96]:
<AxesSubplot:xlabel='0', ylabel='1'>

Let's apply clustering algorithms on the data generated after applying PCA

K-Means¶

In [97]:
distortions = []                                                  # Create an empty list

K = range(2, 10)                                                  # Setting the K range from 2 to 10

for k in K:
    kmeanModel = KMeans(n_clusters=k,random_state=4)              # Initialize K-Means
    kmeanModel.fit(data_pca)                                      # Fit K-Means on the data
    distortions.append(kmeanModel.inertia_)                       # Append distortion values to the empty list created above
In [98]:
# Plotting the elbow plot
plt.figure(figsize=(16, 8))                                            # Setting the plot size

plt.plot(K, distortions, "bx-")                                        # Plotting the K on X-axis and distortions on y-axis

plt.xlabel("k")                                                        # Title of x-axis

plt.ylabel("Distortion")                                               # Title of y-axis

plt.title("The Elbow Method showing the optimal k")                    # Title of the plot
plt.show()

In the above plot, the elbow is seen for K=3 and K=5 as there is some drop in distortion at K=3 and K=5.

Think About It:

  • How do we determine the optimal K value when the elbows are observed at 2 or more K values from the elbow curve?
  • Which metric can be used to determine the final K value?

We can use the silhouette score as a metric for different K values to make a better decision about picking the number of clusters(K).

What is the silhouette score?¶

Silhouette score is one of the methods for evaluating the quality of clusters created using clustering algorithms such as K-Means. The silhouette score is a measure of how similar an object is to its cluster (cohesion) compared to other clusters (separation). Silhouette score has a range of [-1, 1].

  • Silhouette coefficients near +1 indicate that the clusters are dense and well separated, which is good.
  • Silhouette score near -1 indicates that those samples might have been assigned to the wrong cluster.

Finding silhouette score for each value of K

In [99]:
sil_score = []                                                             # Creating empty list
cluster_list = range(3, 7)                                                 # Creating a range from 3 to 7
for n_clusters in cluster_list:
    
    # Initialize K-Means with number of clusters equal to n_clusters and random_state=1
    clusterer = KMeans(n_clusters = n_clusters, random_state = 1)
    
    # Fit and predict on the pca data
    preds = clusterer.fit(data_pca).predict(data_pca) 
    print(preds)
    # Calculate silhouette score - Hint: Use silhouette_score() function
    score = silhouette_score(data_pca, preds)  
    
    # Append silhouette score to empty list created above
    sil_score.append(score)
    
    # Print the silhouette score
    print( "For n_clusters = {}, the silhouette score is {})".format(n_clusters, score))  
[2 0 1 ... 1 1 0]
For n_clusters = 3, the silhouette score is 0.26912817852239623)
[2 1 0 ... 0 0 1]
For n_clusters = 4, the silhouette score is 0.2714184940821574)
[0 1 2 ... 2 2 1]
For n_clusters = 5, the silhouette score is 0.23502481034162284)
[4 0 1 ... 1 1 0]
For n_clusters = 6, the silhouette score is 0.2367264667444932)

From the above silhouette scores, 3 appears to be a good value of K. So, let's build K-Means using K=3.

Applying K-Means on data_pca¶

In [100]:
kmeans = KMeans(n_clusters = 3, random_state = 1)                                # Initialize the K-Means algorithm with 3 clusters and random_state=1

kmeans.fit(data_pca)                                     # Fitting on the data_pca
Out[100]:
KMeans(n_clusters=3, random_state=1)
In [101]:
data_pca["K_means_segments_3"] = kmeans.labels_                    # Adding K-Means cluster labels to the data_pca data

data["K_means_segments_3"] = kmeans.labels_                        # Adding K-Means cluster labels to the whole data

data_model["K_means_segments_3"] = kmeans.labels_                  # Adding K-Means cluster labels to data_model
In [102]:
# Let's check the distribution
data_model["K_means_segments_3"].value_counts()
Out[102]:
0    1058
1     606
2     566
Name: K_means_segments_3, dtype: int64

Let's visualize the clusters using PCA

In [103]:
# Function to visualize PCA data with clusters formed
def PCA_PLOT(X, Y, PCA, cluster):
    sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
In [104]:
PCA_PLOT(0, 1, data_pca, "K_means_segments_3")

Observations and Insights:

  • The scatter plot of data_pca clearly identifies 3 different groups.
  • However, approx. 50% of data is in the first cluster. We need to analyze further and see if we can further distribute data in the clusters evenly.

Cluster Profiling¶

In [105]:
# Taking the cluster-wise mean of all the variables. Hint: First groupby 'data' by 'K_means_segments_3' and then find mean
cluster_profile_KMeans_3 = data.groupby('K_means_segments_3').mean()
In [106]:
# Highlighting the maximum average value among all the clusters for each of the variables
cluster_profile_KMeans_3.style.highlight_max(color="lightgreen", axis=0)
Out[106]:
  Year_Birth Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age Kids Status Family_Size Expenses NumTotalPurchases Engaged_in_days TotalAcceptedCmp AmountPerPurchase
K_means_segments_3                                                                
0 1970.977316 35352.136106 0.756144 0.476371 49.297732 45.251418 5.083176 23.678639 7.349716 5.137996 15.243856 2.029301 2.142722 0.573724 3.293951 6.375236 0.068053 0.015123 0.000000 0.000945 0.001890 0.010397 0.084121 45.022684 1.232514 1.653119 2.885633 101.744802 8.039698 499.818526 0.170132 11.313861
1 1965.942244 57587.478548 0.278878 0.839934 47.528053 453.122112 23.174917 138.214521 30.630363 24.759076 62.414191 3.778878 6.498350 3.103960 7.886139 5.755776 0.070957 0.123762 0.021452 0.036304 0.014851 0.008251 0.133663 50.057756 1.118812 1.655116 2.773927 732.315182 21.267327 594.488449 0.400990 34.272181
2 1968.183746 76370.727915 0.037102 0.210247 50.450530 631.501767 69.478799 467.376325 101.742049 70.863958 78.411661 1.349823 5.183746 6.125442 8.291519 2.879859 0.084806 0.134276 0.263251 0.213781 0.031802 0.007067 0.289753 47.816254 0.247350 1.620141 1.867491 1419.374558 20.950530 550.365724 1.017668 73.334029

Observations and Insights:

  • It looks that Cluster 2 belongs to customers with high income and with maximum purchases.
  • Cluster 1 belongs to customers with medium income with maximum web and deal purchases.
  • Cluster 0 belongs to customers with least income with maximum web visits in a month.

Let us create a boxplot for each of the variables

In [107]:
# Columns to use in boxplot
col_for_box = ['Income','Kidhome','Teenhome','Recency','MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts','MntGoldProds','NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases','NumWebVisitsMonth','Complain','Age','Family_Size','Expenses','NumTotalPurchases','Engaged_in_days','TotalAcceptedCmp','AmountPerPurchase']
In [108]:
# Creating boxplot for each of the variables
all_col = col_for_box

plt.figure(figsize = (30, 50))

sns.set(font_scale=2)

for i, variable in enumerate(all_col):
    plt.subplot(6, 4, i + 1)
    
    sns.boxplot(y=data[variable], x=data['K_means_segments_3'],showmeans=True)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()

Characteristics of each cluster:¶

Cluster 0: belongs to the low income group with most children (kids and teens). They seem to Summary for cluster 0:

  • This cluster has the least acceptance of offers.
  • This cluster has least number of purchases.
  • This cluster has the maximum web visits per month.

Cluster 1: belongs to the medium income group with mostly kids at home. Summary for cluster 1:

  • This cluster has the maximum purchases from deals and from web.
  • This cluster has spent the maximum amount in Gold products.
  • This cluster has accepted offers.

Cluster 2: belongs to the high income group with almost no children (kids and teens) at home. Summary for cluster 2:

  • This cluster has the maximum amount spent in all categories.
  • This cluster has the maximum offers accepted.
  • This cluster prefers catalogue purchases.

Think About It:

  • Are the K-Means profiles with K=3 providing any deep insights into customer purchasing behavior or which channels they are using?
  • What is the next step to get more meaningful insights?

We can see from the above profiles that K=3 segments the customers into High, Medium and Low-income customers, and we are not getting deep insights into different types of customers. So, let's try to build K=5 (which has another elbow in the Elbow curve) and see if we can get better cluster profiles.

In [109]:
# Dropping labels we got from K=3 since we will be using PCA data for prediction
# Drop K_means_segments_3. Hint: Use axis=1 and inplace=True
data_pca.drop("K_means_segments_3", axis=1, inplace=True)
data.drop("K_means_segments_3", axis=1, inplace=True)

Let's build K-Means using K=5

In [110]:
# Fit the K-Means algorithm using number of cluster as 5 and random_state=0 on data_pca
kmeans = KMeans(n_clusters = 5, random_state = 0)  

kmeans.fit(data_pca)                               
Out[110]:
KMeans(n_clusters=5, random_state=0)
In [111]:
# Add K-Means cluster labels to data_pca

# Add K-Means cluster labels to whole data

# Add K-Means cluster labels to data_model
data_pca["K_means_segments_5"] = kmeans.labels_ 

data["K_means_segments_5"] = kmeans.labels_     

data_model["K_means_segments_5"] = kmeans.labels_ 
In [112]:
# Let's check the distribution
data_pca["K_means_segments_5"].value_counts()
Out[112]:
0    1001
2     530
1     451
4     247
3       1
Name: K_means_segments_5, dtype: int64

Let's visualize the clusters using PCA

In [113]:
sns.set(font_scale=1)
# Hint: Use PCA_PLOT function created above
PCA_PLOT(0, 1, data_pca, "K_means_segments_5")

Cluster Profiling¶

In [114]:
# Take the cluster-wise mean of all the variables. Hint: First groupby 'data' by cluster labels column and then find mean
cluster_profile_KMeans_5 = data.groupby('K_means_segments_5').mean()
In [115]:
# Highlight the maximum average value among all the clusters for each of the variables
cluster_profile_KMeans_5.style.highlight_max(color="lightgreen", axis=0)
Out[115]:
  Year_Birth Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age Kids Status Family_Size Expenses NumTotalPurchases Engaged_in_days TotalAcceptedCmp AmountPerPurchase
K_means_segments_5                                                                
0 1971.301698 34762.484515 0.758242 0.458541 49.274725 38.295704 4.849151 21.219780 7.055944 5.070929 14.408591 1.900100 2.004995 0.527473 3.210789 6.342657 0.062937 0.012987 0.000000 0.000999 0.001998 0.010989 0.081918 44.698302 1.216783 1.655345 2.872128 90.900100 7.643357 493.825175 0.160839 10.818857
1 1967.403548 70815.707317 0.062084 0.370288 50.982262 457.130820 67.075388 349.798226 96.077605 69.250554 80.465632 1.705100 5.392461 5.232816 8.931264 3.033259 0.019956 0.019956 0.035477 0.050998 0.000000 0.008869 0.064302 48.596452 0.432373 1.647450 2.079823 1119.798226 21.261641 527.263858 0.190687 54.738124
2 1965.545283 54538.571698 0.362264 0.879245 47.894340 417.652830 15.760377 112.283019 21.528302 15.745283 55.769811 4.152830 6.290566 2.675472 7.143396 6.258491 0.086792 0.126415 0.011321 0.028302 0.011321 0.007547 0.141509 50.454717 1.241509 1.647170 2.888679 638.739623 20.262264 610.262264 0.405660 31.003787
3 1978.000000 51369.000000 0.000000 0.000000 53.000000 32.000000 2.000000 1607.000000 12.000000 4.000000 22.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 38.000000 0.000000 2.000000 2.000000 1679.000000 1.000000 754.000000 1.000000 1679.000000
4 1969.060729 80301.461538 0.044534 0.161943 47.611336 866.437247 61.890688 539.384615 89.506073 64.340081 73.032389 1.331984 5.530364 6.676113 7.805668 3.356275 0.182186 0.311741 0.566802 0.425101 0.085020 0.004049 0.599190 46.939271 0.206478 1.595142 1.801619 1694.591093 21.344130 584.076923 2.170040 83.475670

Let's plot the boxplot

In [116]:
# Create boxplot for each of the variables
all_col = col_for_box

plt.figure(figsize = (30, 50))

sns.set(font_scale=2)

for i, variable in enumerate(all_col):
    plt.subplot(6, 4, i + 1)
    
    sns.boxplot(y=data[variable], x=data['K_means_segments_5'],showmeans=True)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()
sns.set(font_scale=1)

Characteristics of each cluster¶

Cluster 0: Summary for cluster 0:___

  • CLuster 0 belongs to the low income group with most children (kids and teens) at home.
  • This cluster have the least number of purchases.
  • This cluster has the most Web visits pr month.

Cluster 1:___ Summary for cluster 1:___

  • This cluster represnts upper medium income group.
  • This cluster mostly has teens at home.
  • This cluster has the maximum spends on Fruit, Fish, Sweet and Gold products.

Cluster 2:___ Summary for cluster 2:___

  • This cluster also represents upper medium income group.
  • This cluster mostly spends on Wines and Gold products.
  • This cluster prefers deals.
  • This cluster has accepted most offers.

Cluster 3:___ Summary for cluster 3:___

  • This cluster has only 1 data point.
  • Not much can be inferred from 1 single data point.

Cluster 4:___ Summary for cluster 4:___

  • This cluster represents high-income group.
  • This cluster mostly does not have children (kids + teens) at home.
  • This cluster makes the maximum purchases acoss all categories and through all channels.
  • This cluster has accepted most offers.
In [117]:
# Dropping labels we got from K-Means since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True
data_pca.drop("K_means_segments_5", axis=1, inplace=True)
data.drop("K_means_segments_5", axis=1, inplace=True)

From the above profiles, K=5 provides more interesting insights about customer's purchasing behavior and preferred channels for purchasing products. We can also see that the High, Medium and Low income groups have different age groups and preferences, which was not evident in K=3. So, we can choose K=5.

K-Medoids¶

Let's find the silhouette score for K=5 in K-Medoids

In [118]:
kmedo = KMedoids(n_clusters = 5, random_state = 1)           # Initializing K-Medoids with number of clusters as 5 and random_state=1

preds = kmedo.fit(data_pca).predict(data_pca)            # Fit and predict K-Medoids using data_pca

score = silhouette_score(data_pca, preds)           # Calculate the silhouette score

print(score)                   # Print the score
0.11225735300731468

Observations and Insights:

  • A positive value of the Silhoutte score means that the clusters separated.
  • However, a value ~ 0.11 also implies that the clusters might be overlapping.
In [119]:
# Predicting on data_pca and ddding K-Medoids cluster labels to the whole data
data['kmedoLabels'] = preds

# Predicting on data_pca and ddding K-Medoids cluster labels to data_model
data_model['kmedoLabels'] = preds

# Predicting on data_pca and ddding K-Medoids cluster labels to data_pca
data_pca['kmedoLabels'] = preds
In [120]:
# Let's check the distribution
data.kmedoLabels.value_counts()
Out[120]:
0    684
3    639
2    304
1    303
4    300
Name: kmedoLabels, dtype: int64

Let's visualize the clusters using PCA

In [121]:
# Hint: Use PCA_PLOT function created above
PCA_PLOT(0, 1, data_pca, "kmedoLabels")

Cluster Profiling¶

In [122]:
# Take the cluster-wise mean of all the variables. Hint: First group 'data' by cluster labels column and then find mean
kmedo_clustering_segment = data.groupby('kmedoLabels').mean()
In [123]:
# Highlight the maximum average value among all the clusters for each of the variables
kmedo_clustering_segment.style.highlight_max(color="lightgreen", axis=0)
Out[123]:
  Year_Birth Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age Kids Status Family_Size Expenses NumTotalPurchases Engaged_in_days TotalAcceptedCmp AmountPerPurchase
kmedoLabels                                                                
0 1972.519006 32669.786550 0.807018 0.403509 44.461988 21.457602 3.328947 13.428363 4.995614 3.428363 10.295322 1.535088 1.606725 0.290936 2.861111 6.494152 0.081871 0.007310 0.000000 0.000000 0.002924 0.011696 0.076023 43.480994 1.210526 1.652047 2.862573 56.934211 6.293860 470.497076 0.168129 8.523335
1 1969.379538 39736.607261 0.742574 0.673267 45.818482 178.221122 8.557756 63.514851 14.254125 8.828383 43.825083 4.504950 4.442244 1.389439 4.551155 7.729373 0.062706 0.046205 0.000000 0.003300 0.003300 0.016502 0.204620 46.620462 1.415842 1.617162 3.033003 317.201320 14.887789 755.082508 0.320132 20.427279
2 1966.430921 65041.937500 0.131579 0.644737 30.674342 532.453947 26.582237 179.536184 36.779605 31.677632 45.069079 2.569079 6.052632 3.779605 9.069079 4.601974 0.092105 0.259868 0.138158 0.128289 0.049342 0.009868 0.230263 49.569079 0.776316 1.628289 2.404605 852.098684 21.470395 474.516447 0.898026 40.334456
3 1967.508607 72751.917058 0.070423 0.348983 55.028169 618.676056 66.477308 425.378717 94.489828 66.899844 86.555556 1.857590 5.807512 5.758998 8.239437 3.521127 0.078247 0.084507 0.186228 0.156495 0.017214 0.004695 0.230047 48.491393 0.419405 1.624413 2.043818 1358.477308 21.663537 589.438185 0.752739 68.248298
4 1965.630000 49596.180000 0.426667 0.776667 69.103333 180.066667 11.056667 61.730000 15.426667 10.500000 30.003333 2.726667 3.810000 1.686667 5.330000 4.766667 0.033333 0.050000 0.003333 0.013333 0.000000 0.003333 0.010000 50.370000 1.203333 1.720000 2.923333 308.783333 13.553333 430.203333 0.110000 21.181011

Let's plot the boxplot

In [124]:
# Create boxplot for each of the variables
all_col = col_for_box

plt.figure(figsize = (30, 50))

sns.set(font_scale=2)

for i, variable in enumerate(all_col):
    plt.subplot(6, 4, i + 1)
    
    sns.boxplot(y=data[variable], x=data['kmedoLabels'],showmeans=True)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()
sns.set(font_scale=1)

Characteristics of each cluster¶

Cluster 0:__ Summary for cluster 0:___

  • This cluster represents lower income group.
  • This cluster spends the least amount across all categories.
  • This cluster has one of the most web visitors.

Cluster 1:___ Summary for cluster 1:___

  • This cluster represents lower medium income group with most kids at home.
  • This cluster prefers deals the most.
  • This cluster has one of the most web visitors.

Cluster 2:___ Summary for cluster 2:___

  • This cluster represnts the higher income group.
  • This cluster mostly spends on wines and fruits.
  • This cluster has one of the maximum store purchases.

Cluster 3:___ Summary for cluster 3:___

  • This cluster represent the highest income group with most kids at home.
  • Thsi cluster spends the maximum across all categories.
  • This cluster prefers all channels.

Cluster 4:___ Summary for cluster 4:___

  • This cluster represents higher medium income group.
  • This cluster spends across all categories but not the maximum spenders.
  • This cluster prefers all channels.
In [125]:
# Dropping labels we got from K-Medoids since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True
data_pca.drop("kmedoLabels", axis=1, inplace=True)
data.drop("kmedoLabels", axis=1, inplace=True)

Hierarchical Clustering¶

Let's find the Cophenetic correlation for different distances with different linkage methods.

What is a Cophenetic correlation?¶

The cophenetic correlation coefficient is a correlation coefficient between the cophenetic distances(Dendrogramic distance) obtained from the tree, and the original distances used to construct the tree. It is a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points.

The cophenetic distance between two observations is represented in a dendrogram by the height of the link at which those two observations are first joined. That height is the distance between the two subclusters that are merged by that link.

Cophenetic correlation is the way to compare two or more dendrograms.

Let's calculate Cophenetic correlation for each of the distance metrics with each of the linkage methods

In [126]:
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]

# list of linkage methods
linkage_methods = ["single", "complete", "average"]

high_cophenet_corr = 0                                                 # Creating a variable by assigning 0 to it
high_dm_lm = [0, 0]                                                    # Creating a list by assigning 0's to it

for dm in distance_metrics:
    for lm in linkage_methods:
        Z = linkage(data_pca, metric=dm, method=lm)                    # Applying different linkages with different distance on data_pca
        c, coph_dists = cophenet(Z, pdist(data_pca))                   # Calculating cophenetic correlation
        print(
            "Cophenetic correlation for {} distance and {} linkage is {}.".format(
                dm.capitalize(), lm, c
            )
        )
        if high_cophenet_corr < c:                                     # Checking if cophenetic correlation is higher than previous score
            high_cophenet_corr = c                                     # Appending to high_cophenet_corr list if it is higher
            high_dm_lm[0] = dm                                         # Appending its corresponding distance
            high_dm_lm[1] = lm                                         # Appending its corresponding method or linkage
Cophenetic correlation for Euclidean distance and single linkage is 0.8144235363028587.
Cophenetic correlation for Euclidean distance and complete linkage is 0.5382569223048559.
Cophenetic correlation for Euclidean distance and average linkage is 0.851319325915757.
Cophenetic correlation for Chebyshev distance and single linkage is 0.8247390010547467.
Cophenetic correlation for Chebyshev distance and complete linkage is 0.6146660207500972.
Cophenetic correlation for Chebyshev distance and average linkage is 0.8045096030731448.
Cophenetic correlation for Mahalanobis distance and single linkage is 0.6573005880829239.
Cophenetic correlation for Mahalanobis distance and complete linkage is 0.4716566676805196.
Cophenetic correlation for Mahalanobis distance and average linkage is 0.67507197359385.
Cophenetic correlation for Cityblock distance and single linkage is 0.808691020533747.
Cophenetic correlation for Cityblock distance and complete linkage is 0.8294847305072225.
Cophenetic correlation for Cityblock distance and average linkage is 0.8613796016560248.
In [127]:
# Printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
    "Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
        high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
    )
)
Highest cophenetic correlation is 0.8613796016560248, which is obtained with Cityblock distance and average linkage.

Let's have a look at the dendrograms for different linkages with Cityblock distance

In [128]:
#from scipy.cluster.hierarchy import dendrogram, linkage
# List of linkage methods
linkage_methods = ["single", "complete", "average"]

# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]

# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))            # Setting the plot size

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(data_pca, metric="Cityblock", method=method)                  # Measures the distances between two clusters

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")           # Title of dendrogram

    coph_corr, coph_dist = cophenet(Z, pdist(data_pca))                       # Finding cophenetic correlation for different linkages with city block distance
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )

Observations and Insights:

  • We can observe that the complete linkage gives better separated clusters.
  • A threshold value of 40 will give us 4 clusters.

Think about it:

  • Can we clearly decide the number of clusters based on where to cut the dendrogram horizontally?
  • What is the next step in obtaining number of clusters based on the dendrogram?

Let's have a look at the dendrograms for different linkages with Chebyshev distance

In [129]:
# Plot the dendrogram for Chebyshev distance with linkages single, complete and average. 
# Hint: Use Chebyshev distance as the metric in the linkage() function 

# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))            # Setting the plot size

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(data_pca, metric="Chebyshev", method=method)                  # Measures the distances between two clusters

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")           # Title of dendrogram

    coph_corr, coph_dist = cophenet(Z, pdist(data_pca))                       # Finding cophenetic correlation for different linkages with city block distance
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )

Observations and Insights:

  • As seen above using the Cityblock distance method, we can observe that the complete linkage gives better separated clusters.
  • A threshold value of 10 will give us 3 clusters.

Let's have a look at the dendrograms for different linkages with Mahalanobis distance

In [130]:
# Plot the dendrogram for Mahalanobis distance with linkages single, complete and average. 
# Hint: Use Mahalanobis distance as the metric in the linkage() function 

# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))            # Setting the plot size

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(data_pca, metric="Mahalanobis", method=method)                  # Measures the distances between two clusters

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")           # Title of dendrogram

    coph_corr, coph_dist = cophenet(Z, pdist(data_pca))                       # Finding cophenetic correlation for different linkages with city block distance
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )

Observations and Insights:

  • As seen above using the Cityblock distance method, we can observe that the complete linkage gives better separated clusters.
  • A threshold value of 20 will give us 3 clusters.

Let's have a look at the dendrograms for different linkages with Euclidean distance

In [131]:
# Plot the dendrogram for Euclidean distance with linkages single, complete, average and ward. 
# Hint: Use Euclidean distance as the metric in the linkage() function 

# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))            # Setting the plot size

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(data_pca, metric="euclidean", method=method)                  # Measures the distances between two clusters

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")           # Title of dendrogram

    coph_corr, coph_dist = cophenet(Z, pdist(data_pca))                       # Finding cophenetic correlation for different linkages with city block distance
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )

Think about it:

  • Are there any distinct clusters in any of the dendrograms?

Observations and Insights:

  • As seen above using the Cityblock distance method, we can observe that the complete linkage gives better separated clusters.
  • A threshold value of 15 will give us 3 clusters.
In [132]:
# Initialize Agglomerative Clustering with affinity (distance) as Euclidean, linkage as 'Ward' with clusters=3
HCmodel = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='complete',) 

# Fit on data_pca
HCmodel.fit(data_pca)                                                                
Out[132]:
AgglomerativeClustering(linkage='complete', n_clusters=3)
In [133]:
# Add Agglomerative Clustering cluster labels to data_pca
data_pca["HCLabels"] = HCmodel.labels_
# Add Agglomerative Clustering cluster labels to the whole data
data["HCLabels"] = HCmodel.labels_
# Add Agglomerative Clustering cluster labels to data_model
data_model["HCLabels"] = HCmodel.labels_
In [134]:
# Let's check the distribution
data.HCLabels.value_counts()
Out[134]:
0    2225
2       4
1       1
Name: HCLabels, dtype: int64

Let's visualize the clusters using PCA.

In [135]:
# Hint: Use PCA_PLOT function created above
PCA_PLOT(0, 1, data_pca, "HCLabels")

Cluster Profiling¶

In [136]:
# Take the cluster-wise mean of all the variables. Hint: First group 'data' by cluster labels column and then find mean
hc_clustering_segment = data.groupby('HCLabels').mean()
In [137]:
# Highlight the maximum average value among all the clusters for each of the variables
hc_clustering_segment.style.highlight_max(color="lightgreen", axis=0)
Out[137]:
  Year_Birth Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age Kids Status Family_Size Expenses NumTotalPurchases Engaged_in_days TotalAcceptedCmp AmountPerPurchase
HCLabels                                                                
0 1968.880000 51684.222921 0.444494 0.508315 49.117753 305.523146 26.393708 164.082697 37.703820 27.208090 44.181124 2.317303 4.107416 2.628764 5.822472 5.330787 0.073258 0.074607 0.072809 0.064719 0.013034 0.008989 0.150112 47.120000 0.952809 1.644494 2.597303 605.092584 14.875955 538.098427 0.448539 32.531491
1 1978.000000 51369.000000 0.000000 0.000000 53.000000 32.000000 2.000000 1607.000000 12.000000 4.000000 22.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 38.000000 0.000000 2.000000 2.000000 1679.000000 1.000000 754.000000 1.000000 1679.000000
2 1977.750000 119409.750000 0.250000 0.250000 43.500000 19.250000 4.750000 1663.500000 5.250000 1.750000 1.750000 11.250000 0.000000 26.500000 0.250000 0.500000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 38.250000 0.500000 2.000000 2.500000 1696.250000 38.000000 638.000000 0.000000 46.125115

Let's plot the boxplot

In [138]:
# Create boxplot for each of the variables
all_col = col_for_box

plt.figure(figsize = (30, 50))

sns.set(font_scale=2)

for i, variable in enumerate(all_col):
    plt.subplot(6, 4, i + 1)
    
    sns.boxplot(y=data[variable], x=data['HCLabels'],showmeans=True)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()
sns.set(font_scale=1)

Characteristics of each cluster¶

Cluster 0:__ Summary for cluster 0:___

  • This cluster has ~ 99% of data which is not an appropriate distribution of data across clusters.

Cluster 1:___ Summary for cluster 1:___

  • This cluster represents only 1 customer which is not an appropriate distribution of data across clusters.

Cluster 2:___ Summary for cluster 2:___

  • This cluster represents only 4 customer's data which is not an appropriate distribution of data across clusters.

Observations and Insights:

In [139]:
# Dropping labels we got from Agglomerative Clustering since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True
data_pca.drop("HCLabels", axis=1, inplace=True)
data.drop("HCLabels", axis=1, inplace=True)

DBSCAN¶

DBSCAN is a very powerful algorithm for finding high-density clusters, but the problem is determining the best set of hyperparameters to use with it. It includes two hyperparameters, eps, and min samples.

Since it is an unsupervised algorithm, you have no control over it, unlike a supervised learning algorithm, which allows you to test your algorithm on a validation set. The approach we can follow is basically trying out a bunch of different combinations of values and finding the silhouette score for each of them.

In [140]:
# Initializing lists
eps_value = [2,3]                       # Taking random eps value
min_sample_values = [6,20]              # Taking random min_sample value

# Creating a dictionary for each of the values in eps_value with min_sample_values
res = {eps_value[i]: min_sample_values for i in range(len(eps_value))}  
In [141]:
from sklearn.cluster import DBSCAN
# Finding the silhouette_score for each of the combinations
high_silhouette_avg = 0                                               # Assigning 0 to the high_silhouette_avg variable
high_i_j = [0, 0]                                                     # Assigning 0's to the high_i_j list
key = res.keys()                                                      # Assigning dictionary keys to a variable called key
for i in key:
    z = res[i]                                                        # Assigning dictionary values of each i to z
    for j in z:
        db = DBSCAN(eps=i, min_samples=j).fit(data_pca)               # Applying DBSCAN to each of the combination in dictionary
        core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
        core_samples_mask[db.core_sample_indices_] = True
        labels = db.labels_
        silhouette_avg = silhouette_score(data_pca, labels)           # Finding silhouette score 
        print( 
            "For eps value =" + str(i),
            "For min sample =" + str(j),
            "The average silhoutte_score is :",
            silhouette_avg,                                          # Printing the silhouette score for each of the combinations
        )
        if high_silhouette_avg < silhouette_avg:                     # If the silhouette score is greater than 0 or the previous score, it will get appended to the high_silhouette_avg list with its combination of i and j              
            high_i_j[0] = i
            high_i_j[1] = j
For eps value =2 For min sample =6 The average silhoutte_score is : 0.19774043234901528
For eps value =2 For min sample =20 The average silhoutte_score is : 0.33772677532880413
For eps value =3 For min sample =6 The average silhoutte_score is : 0.3441301295900913
For eps value =3 For min sample =20 The average silhoutte_score is : 0.3422032926810415
In [142]:
# Printing the highest silhouette score
print("Highest_silhoutte_avg is {} for eps = {} and min sample = {}".format(high_silhouette_avg, high_i_j[0], high_i_j[1]))
Highest_silhoutte_avg is 0 for eps = 3 and min sample = 20

Now, let's apply DBSCAN using the hyperparameter values we have received above.

In [143]:
# Apply DBSCAN using the above hyperparameter values
dbs = DBSCAN(eps = 3)
In [144]:
# fit_predict on data_pca and add DBSCAN cluster labels to the whole data
data['DBSLabels'] = dbs.fit_predict(data_pca)

# fit_predict on data_pca and add DBSCAN cluster labels to data_model
data_model['DBSLabels'] = dbs.fit_predict(data_pca)

# fit_predict on data_pca and add DBSCAN cluster labels to data_pca
data_pca['DBSLabels'] = dbs.fit_predict(data_pca)
In [145]:
# Let's check the distribution
data.DBSLabels.value_counts()
Out[145]:
 0    2072
-1     158
Name: DBSLabels, dtype: int64

Let's visualize the clusters using PCA.

In [146]:
# Hint: Use PCA_PLOT function created above
PCA_PLOT(0, 1, data_pca, "DBSLabels")

Observations and Insights:

  • DBSCAN returns only 2 clusters. This is a lower number of clusters that we have seen using other algorithms.

Think about it:

  • Changing the eps and min sample values will result in different DBSCAN results? Can we try more value for eps and min_sample?

Note: You can experiment with different eps and min_sample values to see if DBSCAN produces good distribution and cluster profiles.

In [147]:
# Dropping labels we got from DBSCAN since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True
data_pca.drop("DBSLabels", axis=1, inplace=True)
data.drop("DBSLabels", axis=1, inplace=True)

Gaussian Mixture Model¶

Let's find the silhouette score for K=5 in Gaussian Mixture

In [148]:
gmm = GaussianMixture(n_components = 5, random_state = 1) # Initialize Gaussian Mixture Model with number of clusters as 5 and random_state=1

preds = gmm.fit(data_pca).predict(data_pca)            # Fit and predict Gaussian Mixture Model using data_pca

score = silhouette_score(data_pca, preds)           # Calculate the silhouette score

print(score)                   # Print the score
0.14627588259256083

Observations and Insights:

In [149]:
# Predicting on data_pca and add Gaussian Mixture Model cluster labels to the whole data
data["GMMLabels"] = preds
# Predicting on data_pca and add Gaussian Mixture Model cluster labels to data_model
data_model["GMMLabels"] = preds
# Predicting on data_pca and add Gaussian Mixture Model cluster labels to data_pca
data_pca["GMMLabels"] = preds
In [150]:
# Let's check the distribution
data.GMMLabels.value_counts()
Out[150]:
1    759
4    606
3    449
0    340
2     76
Name: GMMLabels, dtype: int64

Let's visualize the clusters using PCA.

In [151]:
# Hint: Use PCA_PLOT function created above
PCA_PLOT(0, 1, data_pca, "GMMLabels")

Cluster Profiling¶

In [152]:
# Take the cluster-wise mean of all the variables. Hint: First group 'data' by cluster labels column and then find mean
gmm_clustering_segment = data.groupby('GMMLabels').mean()
In [153]:
# Highlight the maximum average value among all the clusters for each of the variables
gmm_clustering_segment.style.highlight_max(color="lightgreen", axis=0)
Out[153]:
  Year_Birth Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age Kids Status Family_Size Expenses NumTotalPurchases Engaged_in_days TotalAcceptedCmp AmountPerPurchase
GMMLabels                                                                
0 1966.920588 63390.779412 0.170588 0.726471 45.932353 570.997059 45.547059 232.017647 60.832353 50.879412 78.338235 3.529412 6.732353 5.052941 9.111765 5.591176 0.102941 0.123529 0.052941 0.076471 0.032353 0.011765 0.194118 49.079412 0.897059 1.608824 2.505882 1038.611765 24.426471 642.961765 0.582353 42.887650
1 1971.996047 32166.699605 0.852437 0.444005 48.245059 20.276680 3.067194 12.928854 4.731225 3.280632 10.223979 1.772069 1.633729 0.316206 2.889328 6.397892 0.071146 0.003953 0.000000 0.000000 0.002635 0.013175 0.076416 44.003953 1.296443 1.646904 2.943347 54.508564 6.611331 484.523057 0.154150 7.977926
2 1968.486842 72561.671053 0.144737 0.276316 48.500000 805.065789 60.500000 446.223684 78.486842 58.157895 82.368421 1.276316 3.815789 4.434211 5.486842 4.881579 0.131579 0.355263 0.526316 0.355263 0.118421 0.000000 0.526316 47.513158 0.421053 1.592105 2.013158 1530.802632 15.013158 607.592105 2.013158 121.862778
3 1967.632517 75883.634744 0.020045 0.173719 50.864143 596.213808 65.120267 458.532294 97.810690 65.530067 74.955457 1.093541 5.102450 5.755011 8.285078 2.396437 0.073497 0.102450 0.229399 0.182628 0.011136 0.006682 0.224944 48.367483 0.193764 1.645880 1.839644 1358.162584 20.236080 513.648107 0.824053 67.764551
4 1967.123762 49459.694719 0.437294 0.740924 50.750825 233.471947 11.709571 74.013201 16.118812 11.414191 39.640264 3.412541 4.998350 1.775578 5.823432 6.037954 0.051155 0.080858 0.001650 0.014851 0.003300 0.004950 0.113861 48.876238 1.178218 1.669967 2.848185 386.367987 16.009901 556.782178 0.265677 22.972221

Let's plot the boxplot

In [154]:
# Create boxplot for each of the variables
all_col = col_for_box

plt.figure(figsize = (30, 50))

sns.set(font_scale=2)

for i, variable in enumerate(all_col):
    plt.subplot(6, 4, i + 1)
    
    sns.boxplot(y=data[variable], x=data['GMMLabels'],showmeans=True)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()
sns.set(font_scale=1)

Characteristics of each cluster¶

Cluster 0: Summary for cluster 0:

  • This cluster does not have the highest mean income but it represents customers with highest income (as seen in the outliers).
  • This cluster also does not represent the highest average amount spend across all categories but has the highest amount spends (as seen in outliers).
  • This cluster represents aximum purchases.
  • This cluster has highest store purchases.

Cluster 1: Summary for cluster 1:

  • This cluster represents the lower income group.
  • This cluster has the least amount in purchases and least number of purcahses.

Cluster 2: Summary for cluster 2:

  • This cluster high income group.
  • This cluster spends maximum amount on Wine and Gold products.
  • This cluster has maximum Store purchases and maximum web visits per month.
  • Tis cluster has accepted maximum number of offers.

Cluster 3: Summary for cluster 3:

  • This cluster also represents hig income group.
  • This cluster spedns maximum amount across all products.
  • This cluster prefers all channels and has the least number of web visits.

Cluster 4: Summary for cluster 4:

  • This cluster represents a lower income group.
  • This cluster has maximum children (Kids + Teens) at home.
  • This cluster has the least amount spent on all products.
  • This cluster also prefers deals.

Conclusion and Recommendations¶

  • Refined Insights: What are the most meaningful insights from the data relevant to the problem?
  • Data can be divided into 5 different segments considering the following factors in mind:
    • Income
    • Children at home
    • Money dpend categories
    • Preferred channel.
  • Data in 5 clusters can be evenly distributed across all segments.
  • Comparison of various techniques and their relative performance: How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?
  • K-Means did not provide the appropriate clustering when tried with cluster size = 3 and 5.
  • Hierarchial and DBSCAN clustering techniques did not provide clusters that were evenly distributed.
  • GMM technique was helpful in creating an evenly spread data across 5 clusters but the clusters still had a overlaps and less business significance especially, in terms of the customer profile.
  • K-Medoids technique provided the most meaningful clustering of the given dataset.
  • Proposal for the final solution design: What model do you propose to be adopted? Why is this the best solution to adopt?
  • Final solution design would be to use PCA for dimensionality reduction and use K-Medoids for clustering. This solution holds good for the given dataset.
  • However, in future, the dataset can be analysed further to recommend changes to the approach.
  • This solution was chaosen for the given dataset based upon analyzing the results obtained from other unsupervised learning techniques.
In [155]:
! jupyter nbconvert --to html notebook.ipynb
This application is used to convert notebook files (*.ipynb)
        to various other formats.

        WARNING: THE COMMANDLINE INTERFACE MAY CHANGE IN FUTURE RELEASES.

Options
=======
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
    <cmd> --help-all

--debug
    set log level to logging.DEBUG (maximize logging output)
    Equivalent to: [--Application.log_level=10]
--show-config
    Show the application's configuration (human-readable format)
    Equivalent to: [--Application.show_config=True]
--show-config-json
    Show the application's configuration (json format)
    Equivalent to: [--Application.show_config_json=True]
--generate-config
    generate default config file
    Equivalent to: [--JupyterApp.generate_config=True]
-y
    Answer yes to any questions instead of prompting.
    Equivalent to: [--JupyterApp.answer_yes=True]
--execute
    Execute the notebook prior to export.
    Equivalent to: [--ExecutePreprocessor.enabled=True]
--allow-errors
    Continue notebook execution even if one of the cells throws an error and include the error message in the cell output (the default behaviour is to abort conversion). This flag is only relevant if '--execute' was specified, too.
    Equivalent to: [--ExecutePreprocessor.allow_errors=True]
--stdin
    read a single notebook file from stdin. Write the resulting notebook with default basename 'notebook.*'
    Equivalent to: [--NbConvertApp.from_stdin=True]
--stdout
    Write notebook output to stdout instead of files.
    Equivalent to: [--NbConvertApp.writer_class=StdoutWriter]
--inplace
    Run nbconvert in place, overwriting the existing notebook (only 
            relevant when converting to notebook format)
    Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory=]
--clear-output
    Clear output of current file and save in place, 
            overwriting the existing notebook.
    Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory= --ClearOutputPreprocessor.enabled=True]
--no-prompt
    Exclude input and output prompts from converted document.
    Equivalent to: [--TemplateExporter.exclude_input_prompt=True --TemplateExporter.exclude_output_prompt=True]
--no-input
    Exclude input cells and output prompts from converted document. 
            This mode is ideal for generating code-free reports.
    Equivalent to: [--TemplateExporter.exclude_output_prompt=True --TemplateExporter.exclude_input=True --TemplateExporter.exclude_input_prompt=True]
--allow-chromium-download
    Whether to allow downloading chromium if no suitable version is found on the system.
    Equivalent to: [--WebPDFExporter.allow_chromium_download=True]
--disable-chromium-sandbox
    Disable chromium security sandbox when converting to PDF..
    Equivalent to: [--WebPDFExporter.disable_sandbox=True]
--show-input
    Shows code input. This flag is only useful for dejavu users.
    Equivalent to: [--TemplateExporter.exclude_input=False]
--embed-images
    Embed the images as base64 dataurls in the output. This flag is only useful for the HTML/WebPDF/Slides exports.
    Equivalent to: [--HTMLExporter.embed_images=True]
--log-level=<Enum>
    Set the log level by value or name.
    Choices: any of [0, 10, 20, 30, 40, 50, 'DEBUG', 'INFO', 'WARN', 'ERROR', 'CRITICAL']
    Default: 30
    Equivalent to: [--Application.log_level]
--config=<Unicode>
    Full path of a config file.
    Default: ''
    Equivalent to: [--JupyterApp.config_file]
--to=<Unicode>
    The export format to be used, either one of the built-in formats
            ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'rst', 'script', 'slides', 'webpdf']
            or a dotted object name that represents the import path for an
            ``Exporter`` class
    Default: ''
    Equivalent to: [--NbConvertApp.export_format]
--template=<Unicode>
    Name of the template to use
    Default: ''
    Equivalent to: [--TemplateExporter.template_name]
--template-file=<Unicode>
    Name of the template file to use
    Default: None
    Equivalent to: [--TemplateExporter.template_file]
--theme=<Unicode>
    Template specific theme(e.g. the name of a JupyterLab CSS theme distributed
    as prebuilt extension for the lab template)
    Default: 'light'
    Equivalent to: [--HTMLExporter.theme]
--writer=<DottedObjectName>
    Writer class used to write the 
                                        results of the conversion
    Default: 'FilesWriter'
    Equivalent to: [--NbConvertApp.writer_class]
--post=<DottedOrNone>
    PostProcessor class used to write the
                                        results of the conversion
    Default: ''
    Equivalent to: [--NbConvertApp.postprocessor_class]
--output=<Unicode>
    overwrite base name use for output files.
                can only be used when converting one notebook at a time.
    Default: ''
    Equivalent to: [--NbConvertApp.output_base]
--output-dir=<Unicode>
    Directory to write output(s) to. Defaults
                                  to output to the directory of each notebook. To recover
                                  previous default behaviour (outputting to the current 
                                  working directory) use . as the flag value.
    Default: ''
    Equivalent to: [--FilesWriter.build_directory]
--reveal-prefix=<Unicode>
    The URL prefix for reveal.js (version 3.x).
            This defaults to the reveal CDN, but can be any url pointing to a copy 
            of reveal.js. 
            For speaker notes to work, this must be a relative path to a local 
            copy of reveal.js: e.g., "reveal.js".
            If a relative path is given, it must be a subdirectory of the
            current directory (from which the server is run).
            See the usage documentation
            (https://nbconvert.readthedocs.io/en/latest/usage.html#reveal-js-html-slideshow)
            for more details.
    Default: ''
    Equivalent to: [--SlidesExporter.reveal_url_prefix]
--nbformat=<Enum>
    The nbformat version to write.
            Use this to downgrade notebooks.
    Choices: any of [1, 2, 3, 4]
    Default: 4
    Equivalent to: [--NotebookExporter.nbformat_version]

Examples
--------

    The simplest way to use nbconvert is

            > jupyter nbconvert mynotebook.ipynb --to html

            Options include ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'rst', 'script', 'slides', 'webpdf'].

            > jupyter nbconvert --to latex mynotebook.ipynb

            Both HTML and LaTeX support multiple output templates. LaTeX includes
            'base', 'article' and 'report'.  HTML includes 'basic', 'lab' and 
            'classic'. You can specify the flavor of the format used.

            > jupyter nbconvert --to html --template lab mynotebook.ipynb

            You can also pipe the output to stdout, rather than a file

            > jupyter nbconvert mynotebook.ipynb --stdout

            PDF is generated via latex

            > jupyter nbconvert mynotebook.ipynb --to pdf

            You can get (and serve) a Reveal.js-powered slideshow

            > jupyter nbconvert myslides.ipynb --to slides --post serve

            Multiple notebooks can be given at the command line in a couple of 
            different ways:

            > jupyter nbconvert notebook*.ipynb
            > jupyter nbconvert notebook1.ipynb notebook2.ipynb

            or you can specify the notebooks list in a config file, containing::

                c.NbConvertApp.notebooks = ["my_notebook.ipynb"]

            > jupyter nbconvert --config mycfg.py

To see all available configurables, use `--help-all`.

In [ ]: